A New Probabilistic Model of Text Classi cation and
ثبت نشده
چکیده
This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The multinomial model employs independence assumptions which are similar to assumptions made in previous probabilistic models , particularly the binary independence model and the 2-Poisson model. The use of simulation to study the model is described. Performance of the model is evaluated on the TREC-3 routing task. Results are compared with the binary independence model and with the simulation studies.
منابع مشابه
A Comparison of Event Models for Naive Bayes Text Classi cation
Recent approaches to text classi cation have used two di erent rst order probabilistic models for classi ca tion both of which make the naive Bayes assumption Some use a multi variate Bernoulli model that is a Bayesian Network with no dependencies between words and binary word features e g Larkey and Croft Koller and Sahami Others use a multinomial model that is a uni gram language model with i...
متن کاملA New Probabilistic Model of Text Classi cation and Retrieval
This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The mult...
متن کاملTitle: Hierarchically Classifying Documents Using Very Few Words Authors: Hierarchically Classifying Documents Using Very Few Words
The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. One can use existing classi ers by ignoring the hierarchical structure, treating the topics as separate classes. Unfortunately, in the context of text categorization, we are faced with a large number of classes and a huge number of relevan...
متن کاملCombining weak learners with probabilistic models for binary classication
In this project it is addressed the problem of binary classi cation i.e. we want to nd a model F : X! f0; 1g for the dependence of a class c on x from a set of samples D = fct;xtgt=1 with ct 2 f0; 1g. The approach is to combine multiple base learners with the use of probabilistic models. In particular, I used the methodology presented by Ghersen eld for doing regression in complex time series[...
متن کاملText Classification from Labeled and Unlabeled Documents Using
This paper shows that the accuracy of learned text classi ers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classi cation problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from lab...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996